03_A_TSGraphs_Timeplots_Scatterplots_Seasonalplots

Libraries

library(fpp3)
── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
✔ tibble      3.2.1     ✔ tsibble     1.1.3
✔ dplyr       1.1.2     ✔ tsibbledata 0.4.1
✔ tidyr       1.3.0     ✔ feasts      0.3.1
✔ lubridate   1.9.2     ✔ fable       0.3.3
✔ ggplot2     3.4.3     ✔ fabletools  0.3.3
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::group_rows()  masks kableExtra::group_rows()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
library(fma)
Loading required package: forecast
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

Attaching package: 'forecast'
The following object is masked from 'package:fabletools':

    accuracy
library(readr)

References

This is not original material but a compilation of the following sources with some additional examples and clarifications introduced by the professor:

  1. Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
  2. Additional material provided by the book authors
  3. The Datasaurus Dozen - Link
  4. Shumway, R.; Stoffer, D; Time Series Analysis and Its Applications with R examples.

Framework and packages:

We will work with the fpp3 package, developed by Prof. Rob Hyndman and his team. They are actually one of the leading contributing groups to the time series community, so do not forget their name.

This package extends the tidyverse framework to time series. For that it requires to load some packages under the hood (see output of the following cell):

library(fpp3)

Additionally, for this particular session we will use the packages GGally, fma and patchwork. Install them if you do not have them already:

install.packages("GGally", dependencies = TRUE) # to generate a scatterplot matrix
install.packages("fma", dependencies = TRUE) # to load the Us treasury bills dataset
install.packages("patchwork", dependencies = TRUE) # Used to manage the relative location of ggplots
library(patchwork) # Used to manage the relative location of ggplots
library(GGally)
library(fma) # to load the Us treasury bills dataset

Time plots and basic time series patterns

Time series patterns
Pattern Description
Trend Long-term increase OR decrease in the data
Seasonal Periodic pattern due to the calendar (quarter, month, day of the week...)
Cyclic Pattern where data exhibits rises and falls that are NOT FIXED IN PERIOD (typically duration of at least 2 years)
Seasonal or cyclic
Seasonal Cyclic
Seasonal pattern of constant length Cyclic patern of variable length
Shorter than cyclic pattern Longer than periodic pattern
Magnitude less variable than cyclic pattern Magnitude more variable than periodic pattern

Let us look at some examples:

Example 1 - trended and seasonal data

aus_production %>%
  
  # Filter data to show appropriate timeframe
  filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
  
  autoplot(Electricity) + 
  
  # Scale the x axis adequately
  # scale_x_yearquarter used because the index is a yearquarter column
  scale_x_yearquarter(date_breaks = "1 year",
                      minor_breaks = "1 year") +
  
  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))

Strong trend with a change in the slope of the trend around 1990.

autoplot() as a wrapper around ggplot2 objects

The function autoplot() of the library fpp3 is nothing but a wrapper around ggplot2 objects. Depending on the object fed to autoplot, it will have one or other behavior combining different ggplot objects.

In the example above, autoplot() is nothing but a wrapper around geom_line(). It is just that using autoplot() is much more convenient and it is a good idea to have such a function in a time series dedicated library such as fpp3

aus_production %>% 
  
  filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
  
  # The two lines below are equivalent to autoplot(Electricity)
  ggplot(aes(x = Quarter, y = Electricity)) + 
  geom_line() +
  
  # Scale the x axis adequately
  # scale_x_yearquarter used because the index is a yearquarter column
  scale_x_yearquarter(date_breaks = "1 year",
                      minor_breaks = "1 year") +
  
  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))

Example 2 - (exercise) - trended and seasonal data with recessions

Scale the x-grid of the following graph so that the end of each year is clearly visible

aus_production %>% autoplot(Bricks)
Warning: Removed 20 rows containing missing values (`geom_line()`).

Example 3 - trend embedded within a cycle

US treasury bill contracts on the Chicago market for 100 consecutive trading days in 1981.

See Treasury Bills.

The object ustreas is loaded along with the fma library, which we loaded at the beginning of the notebook.

ustreas
Time Series:
Start = 1 
End = 100 
Frequency = 1 
  [1] 91.57 91.56 91.40 91.39 91.20 90.90 90.99 91.17 90.98 91.11 91.20 90.92
 [13] 90.73 90.79 90.86 90.83 90.80 90.21 90.10 90.10 89.66 89.81 89.40 89.34
 [25] 89.16 88.71 89.28 89.36 89.64 89.53 89.32 89.14 89.41 89.53 89.16 88.98
 [37] 88.50 88.63 88.62 88.76 89.07 88.47 88.41 88.57 88.23 87.93 87.88 88.18
 [49] 88.04 88.18 88.78 89.29 89.14 89.14 89.42 89.26 89.37 89.51 89.66 89.39
 [61] 89.02 89.05 88.97 88.57 88.44 88.52 87.92 87.71 87.52 87.37 87.27 87.07
 [73] 86.83 86.88 87.48 87.30 87.87 87.85 87.83 87.40 86.89 87.38 87.48 87.21
 [85] 87.02 87.10 87.02 87.07 86.74 86.35 86.32 86.77 86.77 86.49 86.02 85.52
 [97] 85.02 84.42 85.29 85.32

ustreas is a ts object (time-series). This is another possible format for time series in R, but remember we are going to use tsibles to allow for seamless functionality with the fpp3 library. You can convert the ts to a tsibble using as_tsibble:

ustreas_tsibble <- as_tsibble(ustreas)
ustreas_tsibble
# A tsibble: 100 x 2 [1]
   index value
   <dbl> <dbl>
 1     1  91.6
 2     2  91.6
 3     3  91.4
 4     4  91.4
 5     5  91.2
 6     6  90.9
 7     7  91.0
 8     8  91.2
 9     9  91.0
10    10  91.1
# ℹ 90 more rows
autoplot(ustreas_tsibble)
Plot variable not specified, automatically selected `.vars = value`

Looks like a downward trend, but it is part of a much longer cycle (not shown). You do not want to make forecasts with this data too long into the future.

scale_x_continuous: adjusting the grid when the index is numeric.

Examining our tsibble reveals that the index is numeric (trading day)

ustreas_tsibble %>% head(5)
# A tsibble: 5 x 2 [1]
  index value
  <dbl> <dbl>
1     1  91.6
2     2  91.6
3     3  91.4
4     4  91.4
5     5  91.2

For this reason, we need to use the function scale_x_continuous. In addition to this, we need to generate a specific sequence signaling the ticks. For example, if we want major ticks every 10 days and minor ticks every 5 days

major_ticks_seq = seq(0, max(ustreas_tsibble$index), 10)
major_ticks_seq
 [1]   0  10  20  30  40  50  60  70  80  90 100
minor_ticks_seq = seq(0, max(ustreas_tsibble$index), 5)
minor_ticks_seq
 [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90
[20]  95 100
ustreas_tsibble %>%
  autoplot() +
  scale_x_continuous(breaks = major_ticks_seq,
                     minor_breaks = minor_ticks_seq)
Plot variable not specified, automatically selected `.vars = value`

Example 4 - exercise. Yearly data and cyclic behavior

ANUAL DATA is not susceptible to calendar variations. It is sampled at the same point every year. Calendar variations might be present but they are not captured by yearly data. Therefore there is no seasonality in yearly data. If anything there could be cyclic behavior.

Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company. We are going analyse the number of Canadian Lynx pelts traded.

pelt %>% head(5)
# A tsibble: 5 x 3 [1Y]
   Year  Hare  Lynx
  <dbl> <dbl> <dbl>
1  1845 19580 30090
2  1846 19600 45150
3  1847 19610 49150
4  1848 11990 39520
5  1849 28040 21230
lynx <- pelt %>% select(Year, Lynx)
lynx %>% head(5)
# A tsibble: 5 x 2 [1Y]
   Year  Lynx
  <dbl> <dbl>
1  1845 30090
2  1846 45150
3  1847 49150
4  1848 39520
5  1849 21230

Exercise

Plot the time series ensuring we have major ticks every 10 years and minor ticks every 5 years:

Example 5 (exercise) - Cyclic + seasonal behavior

Monthly sales of new one-family houses sold in the USA since 1973.

The ts object hsales is loaded along the fma library

hsales_tsibble <- hsales %>% as_tsibble()

Create a timeplot that has major ticks every 5 years and minor ticks every year

Example 6 - multiple seasonality

Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27 August 2000. Discussed in Taylor (2003), and kindly provided by James W Taylor. Units: Megawatts

Again taylor is a time-series object (ts)

class(taylor)
[1] "msts" "ts"  

However, since the series is sampled every half-hour (that is, we are dealing with datetime info), usiung as_tsibble() directly does not adequately convert this time series to a tsibble:

# Column index represented as a double! This is not what we want.
taylor %>% as_tsibble()
# A tsibble: 4,032 x 2 [9.9999999960505e-07]
   index value
   <dbl> <dbl>
 1  1    22262
 2  1.00 21756
 3  1.01 22247
 4  1.01 22759
 5  1.01 22549
 6  1.01 22313
 7  1.02 22128
 8  1.02 21860
 9  1.02 21751
10  1.03 21336
# ℹ 4,022 more rows

To adequately transform the taylor ‘ts’ object to a tsibble, we are going to define a sequence of date-time objects that match the length of the taylor object. Based on the description and assuming the measurements on the 5th of June 2000 start at midnight, the code below turns the ts object into a tsibble. It is not necessary that you fully understand this code, but it is explained in detail:

# Define the start time based on information about the Taylor series
start_time <- as.POSIXct("2000-06-05 00:00:00", tz = "UTC")

# We specify by = 1800 because the lowest unit we specified in our datetime object
# start_time are seconds (hh:mm:ss). Therefore this date-time object will increase
# in steps of seconds and there are 1800 seconds in half an our (30 * 60)
# We specify as well that the sequence needs to be of the same length as the taylor
# object
datetime_seq <- start_time + seq(from = 0, by = 1800, length.out = length(taylor))

# Convert the time series to a tsibble
taylor_tsibble <- 
  tibble(
    index = datetime_seq,
    value = as.numeric(taylor)
  ) %>%
  as_tsibble(index = index)

# Check the resulting tsibble
taylor_tsibble
# A tsibble: 4,032 x 2 [30m] <UTC>
   index               value
   <dttm>              <dbl>
 1 2000-06-05 00:00:00 22262
 2 2000-06-05 00:30:00 21756
 3 2000-06-05 01:00:00 22247
 4 2000-06-05 01:30:00 22759
 5 2000-06-05 02:00:00 22549
 6 2000-06-05 02:30:00 22313
 7 2000-06-05 03:00:00 22128
 8 2000-06-05 03:30:00 21860
 9 2000-06-05 04:00:00 21751
10 2000-06-05 04:30:00 21336
# ℹ 4,022 more rows

Now let us create a quick plot of this object:

autoplot(taylor_tsibble)
Plot variable not specified, automatically selected `.vars = value`

  • Two types of seasonality
    • Daily pattern
    • Weekly pattern
  • If we had data over a few years: we would see annual patern
  • If we had data over a few decades: we might see a longer cyclic pattern

To inspect these patterns more closely, I am going to filter for the first four weeks in the time series. To attain this, I am going to create an auxiliary week column and then filter based on that column:

# Create auxiliary column signalling the yearweek
taylor_tsibble <- 
  taylor_tsibble %>% 
  mutate(
    week = yearweek(index)
  )

# Extract first week and compute 4th week
week1 <-  taylor_tsibble$week[1]
week4 <- week1 + 3

# Filter for first four weeks and store in new object
taylor_tsibble_4w <- 
  taylor_tsibble %>% 
    filter(
      week >= week1, 
      week <= week4
    )

# Depict result
taylor_tsibble_4w %>% 
  
  autoplot() + 
  
  # Used because index is date-time
  scale_x_datetime(
    breaks = "1 week",
    minor_breaks = "1 day"
  )
Plot variable not specified, automatically selected `.vars = value`

Here we can see the patterns more closely.

IMPORTANT: the electricity demand does not always follow such an extremely regular pattern. There might be weekly seasonality, but it is not necessarily so regular.

Half-hourly electricity demand in for Victoria (Australia).

vic_elec %>% head(5)
# A tsibble: 5 x 5 [30m] <Australia/Melbourne>
  Time                Demand Temperature Date       Holiday
  <dttm>               <dbl>       <dbl> <date>     <lgl>  
1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   

scale_x_datetime indexes that are date-time objects

Let us filter the electricity consumption in Victoria vic_elec to include only datapoints of January 2012:

elec_jan <- vic_elec %>%
              mutate(month = yearmonth(Time)) %>%
              filter(month == yearmonth("2012 Jan"))

Create a time-plot of this data using scale_x_datetime to establish major breaks of 1 week and minor breaks of 1 day.

Example 7 - Trend + seasonality (multiplicative)

PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run ?PBS in the console.

PBS %>%
  filter(ATC2 == "A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost)) %>%
  
  # Comment on this assignment operator
  mutate(Cost = TotalC / 1e6) -> a10
a10 %>% head(5)
# A tsibble: 5 x 3 [1M]
     Month  TotalC  Cost
     <mth>   <dbl> <dbl>
1 1991 Jul 3526591  3.53
2 1991 Aug 3180891  3.18
3 1991 Sep 3252221  3.25
4 1991 Oct 3611003  3.61
5 1991 Nov 3565869  3.57
autoplot(a10, Cost) +
  labs(y = "$ (millions)",
       title = "Australian antidiabetic drug sales") +
  scale_x_yearmonth(breaks = "1 year") +
  theme(axis.text.x = element_text(angle = 90))

Scatter plots

Two variables

vic_elec
# A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
   Time                Demand Temperature Date       Holiday
   <dttm>               <dbl>       <dbl> <date>     <lgl>  
 1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
 2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
 3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
 4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
 5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
 6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
 7 2012-01-01 03:00:00  3694.        20.1 2012-01-01 TRUE   
 8 2012-01-01 03:30:00  3562.        19.6 2012-01-01 TRUE   
 9 2012-01-01 04:00:00  3433.        19.1 2012-01-01 TRUE   
10 2012-01-01 04:30:00  3359.        19.0 2012-01-01 TRUE   
# ℹ 52,598 more rows
p1 <- vic_elec %>%
  filter(year(Time) == 2014) %>%
  autoplot(Demand) +
  labs(y = "GW",
       title = "Half-hourly electricity demand: Victoria")

p2 <- vic_elec %>%
  filter(year(Time) == 2014) %>%
  autoplot(Temperature) +
  labs(
    y = "Degrees Celsius",
    title = "Half-hourly temperatures: Melbourne, Australia"
  )

p1 / p2

vic_elec %>%
  filter(year(Time) == 2014) %>%
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(x = "Temperature (degrees Celsius)",
       y = "Electricity demand (GW)")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Note: interpretation of correlation coefficients

The correlation coefficient only measures the strength of linear relationship. For example, the correlation for the electricity demand and temperature data shown in the previous figure is only 0.26, but their non-linear relationship is actually much stronger:

# Compute correlation coefficient
round(cor(vic_elec$Temperature, vic_elec$Demand), 2)
[1] 0.26

Remember that a correlation coefficient of 0 does not imply that there is no relationship between two variables. It just implies that there is no linear relationship.

Note: beware that there are very bad sources on statistics and data science on the internet:

Let us see a specific toy example that will you should keep in your head forever:

df <- tibble(
              x = seq(-4, 4, 0.05),
              y = x^2
            )

df %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

cor(df$x, df$y)
[1] 1.494299e-16

QUESTION: Why is it not exactly 0?

Other examples of variables with a correlation coefficient of 0:

Examples of corr coefficient = 0 (Wikipedia)

Remember as well that two variables may have a very similar correlation coefficent yet have a very different aspect. The example below (ref [1]) shows scatterplots of different variables, all of which have a correlation coefficient of 0.82:

Examples of variables having the same correlation coefficient. From [1]

The image above is known as Ancombe’s quartet. There are infinately many such datapoints with the same correlation coefficient. As an example see the funny .gif image below:

Datasaurus - from ref. 3

All these examples are meant to make you aware of how important it is to depict a scatterplot of the variables and not rely only on the correlation coefficient, which is a summary only of their linear relationship.

Several variables - scatterplot matrix

visitors <- tourism %>%
  group_by(State) %>%
  summarise(Trips = sum(Trips))

visitors
# A tsibble: 640 x 3 [1Q]
# Key:       State [8]
   State Quarter Trips
   <chr>   <qtr> <dbl>
 1 ACT   1998 Q1  551.
 2 ACT   1998 Q2  416.
 3 ACT   1998 Q3  436.
 4 ACT   1998 Q4  450.
 5 ACT   1999 Q1  379.
 6 ACT   1999 Q2  558.
 7 ACT   1999 Q3  449.
 8 ACT   1999 Q4  595.
 9 ACT   2000 Q1  600.
10 ACT   2000 Q2  557.
# ℹ 630 more rows
# Check distinct values of column "State"
distinct(visitors, State)
# A tibble: 8 × 1
  State             
  <chr>             
1 ACT               
2 New South Wales   
3 Northern Territory
4 Queensland        
5 South Australia   
6 Tasmania          
7 Victoria          
8 Western Australia 
visitors %>%
  pivot_wider(values_from=Trips, names_from=State)
# A tsibble: 80 x 9 [1Q]
   Quarter   ACT `New South Wales` `Northern Territory` Queensland
     <qtr> <dbl>             <dbl>                <dbl>      <dbl>
 1 1998 Q1  551.             8040.                 181.      4041.
 2 1998 Q2  416.             7166.                 314.      3968.
 3 1998 Q3  436.             6748.                 528.      4594.
 4 1998 Q4  450.             7282.                 248.      4203.
 5 1999 Q1  379.             7585.                 185.      4332.
 6 1999 Q2  558.             7054.                 366.      4824.
 7 1999 Q3  449.             6724.                 501.      5018.
 8 1999 Q4  595.             7136.                 248.      4350.
 9 2000 Q1  600.             7296.                 206.      4413.
10 2000 Q2  557.             6445.                 360.      4344.
# ℹ 70 more rows
# ℹ 4 more variables: `South Australia` <dbl>, Tasmania <dbl>, Victoria <dbl>,
#   `Western Australia` <dbl>
visitors %>%
  pivot_wider(values_from=Trips, names_from=State) %>%
  GGally::ggpairs(columns = 2:9) + 
  theme(axis.text.x = element_text(angle = 90))

We can add a loess line to the scatterplots to help identify non-linearities with the following code:

visitors %>%
  pivot_wider(values_from=Trips, names_from=State) %>%
  GGally::ggpairs(columns = 2:9, lower = list(continuous = wrap("smooth_loess", color="lightblue"))) + 
  theme(axis.text.x = element_text(angle = 90))

Exercise: scatterplots matrix

The code snippet below loads the dataset soi_recuitment. Run the code and select the file soi_recruitment.csv when prompted to pick up a file. You will find this in the folder ZZ_Datasets, on google drive. This dataset contains two variables and has been taken from reference 4.

  • SOI: Southern Oscillation Index (SOI) for a period of 453 months ranging over the years 1950-1987. The Southern Oscillation Index (SOI) is a standardized index based on the observed sea level pressure (SLP) differences between Tahiti and Darwin, Australia. The SOI is one measure of the large-scale fluctuations in air pressure occurring between the western and eastern tropical Pacific (i.e., the state of the Southern Oscillation) during El Niño and La Niña weather episodes. For more information about this variable, see link
  • Recruitment: Recruitment (index of the number of new fish) for a period of 453 months ranging over the years 1950-1987. Recruitment is loosely defined as an indicator of new members of a population. The data has been gathered by Dr. Roy Mendelssohn, from the Pacific Environmental Fisheries Group.
soi_recruitment <- 
   read_csv(file.choose()) %>% 
   mutate(ym = yearmonth(index)) %>% 
   select(ym, SOI, recruitment) %>% 
   as_tsibble(index = ym)

# Compute desired lags
for (i in seq(1, 8)) {
  lag_name <- paste0("SOI_l", i)
  soi_recruitment[[lag_name]] = lag(soi_recruitment[["SOI"]], i)
}

# Reorder
soi_recruitment <- 
  soi_recruitment %>% 
  select(ym, recruitment, everything())

Generate the scatterplot matrix below:

Question: do you detect any non-linear relationship between recruitment and SOI or any of its lags soi_li that is not adequately captured by the corr. coefficient?

Seasonal plots - gg_season

Difference with a time-plot: now the x-axis does not extend indefinately over time, but rather only over the duration of a season. The time axis is limited to a single season. In most of the situations we are going to handle in this course, that length of the season will be a year.

The period of a time series is always of greater order than the sampling frequency of the time series.

  • Example 1: with quarterly data, the natural seasonal period to consider is the immediatly superior calendar unit, which is a year.
  • Example 2: with monthly data, the natural seasonal period to consider are the immediatly superior calendar units, either a quarter or a year. In most instances we will consider a year when dealing with monthly data:
  • Example 3: with daily data, multiple seasonal periods can be considered: week, month, year…
  • Example 4: with sub-daily data, the day itself could be considered as a seasonal period…

Let us consider again the example of the antidiabetic drug sales in Australia. The dataset PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run ?PBS in the console.

PBS %>%
  filter(ATC2 == "A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost)) %>%
  
  # Comment on this assignment operator
  mutate(Cost = TotalC / 1e6) -> a10

a10
# A tsibble: 204 x 3 [1M]
      Month  TotalC  Cost
      <mth>   <dbl> <dbl>
 1 1991 Jul 3526591  3.53
 2 1991 Aug 3180891  3.18
 3 1991 Sep 3252221  3.25
 4 1991 Oct 3611003  3.61
 5 1991 Nov 3565869  3.57
 6 1991 Dec 4306371  4.31
 7 1992 Jan 5088335  5.09
 8 1992 Feb 2814520  2.81
 9 1992 Mar 2985811  2.99
10 1992 Apr 3204780  3.20
# ℹ 194 more rows
# Let us remember the time-plot
autoplot(a10, Cost) +
  labs(y = "$ (millions)",
       title = "Australian antidiabetic drug sales") +
  scale_x_yearmonth(breaks = "1 year") +
  theme(axis.text.x = element_text(angle = 90))

In this case we have quarterly data, so the natural period to consider is a year, which covers all the calendar variations and is of higher order than the data considered. The time-plot seems to confirm this as well. We will deal with seasonality and periods in more detail in the coming lessons through the concept of autocorrelation.

Now ley us produce a seasonal plot using gg_season

a10 %>%
  gg_season(Cost, labels = "both") + # Labels -> "both", "right", "left"
  labs(y = "$ (millions)",
       title = "Seasonal plot: Antidiabetic drug sales")

As you can see, the length of the x-axis is limited to the duration of a year. Then each year is represented and color coded using a gradient scale. On this plot we can see that there are jumps in the general level of the series each year. We also see a fiest decline from January to Febrary and a subsequent recovery throughout the rest of the year

Multiple seasonal periods

The dataset vic_elec, loaded along with fpp3, contains half-hourly electricity demand for the state of Victoria (Australia) for three years. Because data is sampled every half-hour, we can find multiple seasonal patterns in the data:

  • Daily pattern (period 48)
  • Weekly pattern
  • Yearly pattern…

Daily period example

Let us examine when the data begins and ends. A possible way to do this is:

head(vic_elec)
# A tsibble: 6 x 5 [30m] <Australia/Melbourne>
  Time                Demand Temperature Date       Holiday
  <dttm>               <dbl>       <dbl> <date>     <lgl>  
1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
tail(vic_elec)
# A tsibble: 6 x 5 [30m] <Australia/Melbourne>
  Time                Demand Temperature Date       Holiday
  <dttm>               <dbl>       <dbl> <date>     <lgl>  
1 2014-12-31 21:00:00  3928.        20.3 2014-12-31 FALSE  
2 2014-12-31 21:30:00  3873.        19   2014-12-31 FALSE  
3 2014-12-31 22:00:00  3792.        18.5 2014-12-31 FALSE  
4 2014-12-31 22:30:00  3725.        17.7 2014-12-31 FALSE  
5 2014-12-31 23:00:00  3762.        17.3 2014-12-31 FALSE  
6 2014-12-31 23:30:00  3809.        17.1 2014-12-31 FALSE  

It appears that the dataset contains measurements of the years 2012, 2013 and 2014 (three years).

It is possible to adjust the seasonal period considered by gg_season (if applicable to the data) through the argument period. For example, we may consider a daily period as follows:

vic_elec %>% gg_season(Demand, period = "day") +
  theme(legend.position = "none") +
  labs(y="MWh", title="Electricity demand: Victoria")

  1. Question: what does each line represent in the data?
  2. Question: how many datapoints do we have?
  3. Question: how many lines do we have in our graph?
  4. Question how many data points per line do we have?

Solutions to be provided after class:

Weekly period example

vic_elec %>% gg_season(Demand, period = "week") +
  theme(legend.position = "none") +
  labs(y="MWh", title="Electricity demand: Victoria")

  1. Question: what does each line represent in the data?
  2. Question: how many weeks are there in a year?
  3. Question: How many lines do we have in this data?
  4. Question how many datapoints does each line have?

yearly perio example

vic_elec %>% gg_season(Demand, period = "year") +
  labs(y="MWh", title="Electricity demand: Victoria")

  1. Question: what does each line represent?
  2. Question: how many datapoints do we have per line?
  3. Question: propose an alternatve, less noisy way of representing the data:

Seasonal subseries plot

Exercise seasonal plots

The aus_arrivals (loaded with the fpp3 library) data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.

Create a:

  • Timeplot of the arrivals using autoplot() (one curve per country of origin)
  • Seasonal plot using gg_season() (due to the structure of the data it will produce one graph per country of origin)
  • A gg_subseries() plot that shows the evolution of the quarterly arrivals per country over the time period considered (see next section on gg_subseries())
aus_arrivals
# A tsibble: 508 x 3 [1Q]
# Key:       Origin [4]
   Quarter Origin Arrivals
     <qtr> <chr>     <int>
 1 1981 Q1 Japan     14763
 2 1981 Q2 Japan      9321
 3 1981 Q3 Japan     10166
 4 1981 Q4 Japan     19509
 5 1982 Q1 Japan     17117
 6 1982 Q2 Japan     10617
 7 1982 Q3 Japan     11737
 8 1982 Q4 Japan     20961
 9 1983 Q1 Japan     20671
10 1983 Q2 Japan     12235
# ℹ 498 more rows

Seasonal subseries plots: gg_subseries()

Alternative plot to emphasize seasonal patterns. The data for each season are collected together in separate mini time plots, printing one plot for each of the seasonal divisions. All this is best understood through examples:

Example with monthly data

a10 
# A tsibble: 204 x 3 [1M]
      Month  TotalC  Cost
      <mth>   <dbl> <dbl>
 1 1991 Jul 3526591  3.53
 2 1991 Aug 3180891  3.18
 3 1991 Sep 3252221  3.25
 4 1991 Oct 3611003  3.61
 5 1991 Nov 3565869  3.57
 6 1991 Dec 4306371  4.31
 7 1992 Jan 5088335  5.09
 8 1992 Feb 2814520  2.81
 9 1992 Mar 2985811  2.99
10 1992 Apr 3204780  3.20
# ℹ 194 more rows
a10 %>%
  gg_subseries(Cost) +
  labs(
    y = "$ (millions)",
    title = "Australian antidiabetic drug sales",
  )

Why does it produce this output?

Because of the date format: Year-Month.

  • The smallest time unit in the index is taken as unit in which the period is to be split. In this case month.
  • The period considered is the biggest possible period included in the time index. In this case, year.

As you can see, specifying “year” as the period is simply redundant in this case.

a10 %>%
  gg_subseries(Cost, period = "year") +
  labs(
    y = "$ (millions)",
    title = "Australian antidiabetic drug sales",
  )

Question: what will the output of this code be?

a10 %>%
  gg_subseries(Cost, period = "month") +
  labs(
    y = "$ (millions)",
    title = "Australian antidiabetic drug sales",
  )

Question what is the output of this code?

a10 %>%
  gg_subseries(Cost, period = "day") +
  labs(
    y = "$ (millions)",
    title = "Australian antidiabetic drug sales",
  )

Example with sub-daily data

NOTE: this example is not extremely important from a mathematical point of view, but rather to be able to understand the logic behind sub-seasonal plots.

head(vic_elec)
# A tsibble: 6 x 5 [30m] <Australia/Melbourne>
  Time                Demand Temperature Date       Holiday
  <dttm>               <dbl>       <dbl> <date>     <lgl>  
1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
tail(vic_elec)
# A tsibble: 6 x 5 [30m] <Australia/Melbourne>
  Time                Demand Temperature Date       Holiday
  <dttm>               <dbl>       <dbl> <date>     <lgl>  
1 2014-12-31 21:00:00  3928.        20.3 2014-12-31 FALSE  
2 2014-12-31 21:30:00  3873.        19   2014-12-31 FALSE  
3 2014-12-31 22:00:00  3792.        18.5 2014-12-31 FALSE  
4 2014-12-31 22:30:00  3725.        17.7 2014-12-31 FALSE  
5 2014-12-31 23:00:00  3762.        17.3 2014-12-31 FALSE  
6 2014-12-31 23:30:00  3809.        17.1 2014-12-31 FALSE  

Question: what would be the output of this code? DO NOT RUN THIS CELL. It might crash your computer.

vic_elec %>%
  gg_subseries(Demand) +
  labs(
    y = "Demand [MW]",
    title = "Half-hourly energy demand in the state of Victoria",
  )

It would take the greatest time unit in the index (year) and split it by the smallest time unit in the index (half-an hour). Since there are 365 * 24 * 2 = 17520 half-hours in a year, it would create 17520 graphs. This would be useless, unreadable and would probably crash your computer.

Takeaway: we need to be careful when we want to use the gg_subseries() function with sub-daily data.

A first possibility would be to specify period = “day”. This is a first option if we have sub-daily data because a day is the immediately superior time unit if we are dealing with hourly data.

Question: what is the output of this code?

vic_elec %>%
  gg_subseries(Demand, period = "day") +
  labs(
    y = "[MW]",
    title = "Total electricity demand for Victoria, Australia"
  )

# Necessary to explore the output given the density of the data
ggsave("subseasonal_demand_hourly_PeriodDay.svg", width = 30, height = 5)

Question: what is the output of this code?

vic_elec_m <- vic_elec %>%
  index_by(month = yearmonth(Time)) %>%
  summarise(
    avg_demand = mean(Demand)
  )

vic_elec_m %>%
  gg_subseries(avg_demand, period = "year") + 
  labs(
    y = "[MW]",
    title = "Total electricity demand for Victoria, Australia"
  )

Question: what is the output of this code?

We could also aggregate by quarter
vic_elec_q <- vic_elec %>%
  index_by(quarter = yearquarter(Time)) %>%
  summarise(
    avg_demand = mean(Demand)
  )

vic_elec_q %>%
  gg_subseries(avg_demand, period = "year") + 
  labs(
    y = "[MW]",
    title = "Total electricity demand for Victoria, Australia"
  )

Exercise - subseries plot

  1. Aggregate the vic_elec data into daily data computing the average demand for each day using index_by() + summarise() combined with mean()
  2. Create a gg_subseries() plot that has one graph for each day of the month and plots the values on each day for each month only for the year 2012

See expected output on the separate file subseasonal_demand_daily_PeriodMonth.svg

Example with daily data

Question: what would be the output of this code?

vic_elec_d <- vic_elec %>%
  
  index_by(day = as_date(Time)) %>%
  
  summarise(
    avg_demand = mean(Demand)
  )

# Filter values for 2012
filter(vic_elec_d, year(day) == 2012) %>%
  
  # Create gg subseries plot
  gg_subseries(avg_demand, period = "year") + 
  labs(
    y = "[MW]",
    title = "Total electricity demand for Victoria, Australia"
  ) 

ggsave("subseasonal_demand_daily_PeriodYear.svg", width = 40, height = 5)